Introduction to Open Data Science - Course Project

Week 1: Warm up

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Mon Nov 22 16:26:48 2021"

“How are you feeling right now? What do you expect to learn? Where did you hear about the course?”

I am extremely excited about this course! I have just started my PhD studies at the Institute of Atmospheric and Earth System Research. My research topic is related to building a framework for climate competencies, based mostly on interviews and surveys. Since my background is in natural sciences I have very little to no experience on behavioral sciences and their methodology. For this reason, I have chosen this course. It was suggested to me by our colleagues in the faculty of educational sciences.


I expect to learn to use R, which I’m not too worried about since I have experience in Python, and even more importantly, the statistical practices and rules that apply in human sciences, when it’s not all about nature and numbers.


My repository

As text: https://github.com/joulasip/IODS-project.git


Week 2: Regression and model validation

date()
## [1] "Mon Nov 22 16:26:48 2021"


1 Description of the data


The data includes survey results from statistics course students in Finland regarding their learning. Data is collected 2014-2015. After combining variables that measure the same dimension and excluding observations where the exam points are zero, the data consists of 166 observations and 7 variables listed here:

  • gender
  • age
  • attitude
  • deep approach (maximized understanding with a true commitment)
  • surface approach (memorizing without understanding)
  • strategic approach (applying strategy to maximize learning)
  • points (max points)

The three learning approaches combine 8 subscales that all include 4 questions:

  • deep: Seeking meaning, relating ideas, use of evidence
  • surface: lack of purpose, unrelated memorizing, syllabus-boundness
  • strategic: organized studying, time management

Reading the data and checking the structure (working diary being IODS-project):

data <- read.csv("data/learning2014", row.names = 1)
str(data)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...


2 Graphical overview of the data


Summary of the data — basic statistics of the variables:

summary(data)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Graphical overview using libraries GGally and ggplot2. The first variable, gender, is defining the colour and alpha sets transparency in the graphs so that the overlapping data can be seen:

library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
p <- ggpairs(data, mapping = aes(col=data$gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

The correlation coefficient, corr, refers to Pearson correlation by default — correlations equal to +1 or −1 correspond to data points lying exactly on a line. Stars in the end the correlation value represent the p-values (as explained here):

  • “***” p < 0.001
  • “**” p < 0.01
  • “*” p < 0.05
  • “.” p < 0.10


Interpretation:

There are almost twice as many female students than male. Majority of the students are between 20 and 30 years old. Male students have generally better attitude than female.

Points and attitude have the highest positive correlation 0.437, which is also significant (p < 0.001). Highest significant negative correlation can be seen between surface and deep learning approaches with corr = -0.324 (male up to -0.622). This is to be expected since they could be considered the opposite.

Strategic learning approach is slightly more common among female students than male.

Less significant negative correlation can be seen between surface and strategic learning approaches, corr = -0.161, and between surface learning approach and attitude, corr = -0.176.

The distribution of deep learning approach is shifted more towards higher values than with strategic or surface approach. The mean and median are approx. 3.6 — for strategic approach approx 3.1 and surface approach 2.8.

To me it seems odd, that the points do not correlate with deep learning approach at all. Could this mean that the exam/assignments do not measure the deep learning? Or that the deep learning approach does not in fact lead to learning? There is more evidence (all though no significance shown) that the surface learning approach has some negative correlation and strategic approach positive with the points.

3 Multiple regression


Since the points have a highly significant correlation with attitude and a slightly significant correlation with surface and strategic learning approaches, I have chosen those three for the multiple regression as explanatory variables to the target variable “points”.

my_model_multi <- lm(points ~ attitude + stra + surf, data = data)
summary(my_model_multi)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08


The following formula describes the multiple linear model:

\(y = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \varepsilon_i\)

where \(\beta_0\) is the Intersept, \(\beta_1\) regression coefficient (Estimate) for attitude, \(\beta_2\) for strategic approach and \(\beta_3\) for surface approach. These measure the change in the mean response associated with a change in the corresponding explanatory variable. \(\varepsilon_i\) refers to the error terms (Std. Error) that are assumed to have a normal distribution with zero mean and the same variance \(\sigma^2\) for all values of the explanatory variables (as explained in the course material book Multivariate Analysis for the Behavioural Sciences).


The residuals (difference between an observed value of the target variable and the fitted value) have a median approx. 0.52. F-statistic gives a test of an omnibus null hypothesis that all the regression coefficients are zero, and it is calculated by dividing regression mean square (RGSM) by residual mean square (RMS). RMS gives an estimate of variance \(\sigma^2\). Since here F-statistic has a very low associated p-value (3.156e-08) it is very unlikely that all of the coefficients are zero — as can be expected.

The t-values are obtained by dividing the estimated regression coefficient by the standard error of the estimate, and the associated significance levels (Pr(>|t|) in the table) can indicate the importance of the explanatory variable.


According to the model, attitude has the strongest and the most significant (p < 0.001) positive correlation with the points. Since only the attitude has a high significance, I will try making models with attitude and surface approach, and attitude and strategic approach separately. The latter summary shown below.

my_model_2 <- lm(points ~ attitude + stra, data = data)
summary(my_model_2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

4 Multiple R-squared and relationship between variables


\(R^2\) gives the proportion of variability in the target variable accounted for by the explanatory variables — in the first case approx. 21% of the variability in points is accounted for by these three explanatory variables: attitude, strategic learning approach and surface learning approach. \(R\) is a measure of the fit of the model, the multiple correlation coefficient.


In the model with only attitude and strategic learning approach as explanatory variables, (shown above) the strategic approach has p-value of 0.09 so still below 0.1, so slightly better. In this model, according to \(R^2\), approximately the same amount of the variability, 20%, can be explained by the combination of attitude and strategic approach than in the previous model with three explanatory variables. This gives confidence that the importance of surface learning approach is very insignificant.


Let’s still try how the model would look with only one explanatory variable, attitude. The resulted summary is shown below — attitude alone counts for 19% of the variability in points.

my_model_3 <- lm(points ~ attitude, data = data)
summary(my_model_3)
## 
## Call:
## lm(formula = points ~ attitude, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

It is important to note that the connections between the explanatory variables can also have a major effect on the model, and they need to be taken into account when deciding which variables to use and which leave out! Here the surface and strategic learning approaches do have small negative correlation according to the overview of the data (part 2) but still much less significant than the correlation between attitude and points, so I consider it not important.

5 Diagnostic plots

I will continue the analysis with the model including attitude and strategic learning approach as the explanatory variables.

par(mfrow = c(2,2))
plot(my_model_2, which = c(1,2,5))


The graph above includes the diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs. Leverage.

Assumptions of the model:

  • Linearity
  • Errors are normally distributed, not correlated and have a constant variance, \(\sigma^2\)


QQ-plot allows exploring the normality assumption — better the points fall into the line, better the assumption. The fit here is rather good especially in the middle, but there seem to be some outliers in both ends of the distribution that fall under the line.


Constant variance of errors and them not being dependent on the values of the explanatory variables can be assessed with the Residuals vs. fitted values graph. Any pattern in the plot would mean issues with this assumption. Here the plot has no clear patterns but the same outliers are marked in this one as well — the observations 145, 56 and 35 seem odd. Based on these two plots, I would look more deeply into those answers and would consider removing those observations all together if a clear reason for their behaviour is found.


Residuals vs leverage graph helps to identify observations that have unusually high impact on the model. Here again few observations stand out, and they are marked with numbers — 145 and 35 are there again. Since Cook’s distance is slightly higher around the same leverage values, it means that these values have a rather large influence on the model and should therefore be looked into more.


Overall the diagnostic plots look still quite good and the model can be considered valid. The same outliers are also present if the diagnostic plots are made for the model with only attitude as explanatory variable. Removing the outliers could lead to larger significance and clearer model results.



Week 3: Logistic regression

date()
## [1] "Mon Nov 22 16:26:54 2021"


1 Description of the data (1p)

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(boot)


alc <- read.csv("data/alc", row.names = 1)
dim(alc)
## [1] 370  51
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "n"          "id.p"       "id.m"      
## [31] "failures"   "paid"       "absences"   "G1"         "G2"        
## [36] "G3"         "failures.p" "paid.p"     "absences.p" "G1.p"      
## [41] "G2.p"       "G3.p"       "failures.m" "paid.m"     "absences.m"
## [46] "G1.m"       "G2.m"       "G3.m"       "alc_use"    "high_use"  
## [51] "cid"

The data represents Portuguese secondary school students’ performance including grades, demographic, social and school related features. It was collected by using school reports and questionnaires. Two subjects are combined here: Mathematics and Portuguese language. More information can be found here including explanations of all the variables listed above: https://archive.ics.uci.edu/ml/datasets/Student+Performance

‘Alc_use’ combines weekday (Dalc) and weekend (Walc) alcohol use. ‘High_use’ is defined as TRUE if ‘alc_use’ > 2, so students consumption of alcohol is defined high if they are using alcohol twice a week or more.


2 Hypothesis (1p)

Variables that I assume having important correlation to alcohol consumption, and my hypothesis about the connection:

  • G3: final grade (numeric: from 0 to 20)
    • “final grade is negatively correlated with alcohol consumption”
  • goout - going out with friends (numeric: from 1 - very low to 5 - very high)
    • “going out with friends correlated strongly positively with alcohol consumption”
  • absences - number of school absences (numeric: from 0 to 93)
    • “absences can be a consequence of high alcohol consumption”
  • age - student’s age (numeric: from 15 to 22)
    • “younger students use more alcohol”

In addition, I will include gender as a divider in the graphs.


3 Numerical and Graphical exploration (5p)

Glimpse to the selected data again:

#select the wanted variables to check
sel_alc <- select(alc,one_of(c("G3","goout","absences","age")))
glimpse(sel_alc)
## Rows: 370
## Columns: 4
## $ G3       <int> 12, 8, 12, 9, 12, 12, 6, 10, 16, 10, 15, 6, 11, 12, 12, 14, 1…
## $ goout    <int> 2, 4, 1, 2, 1, 2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 1, 2, 5…
## $ absences <int> 3, 2, 8, 2, 5, 2, 0, 1, 9, 10, 0, 3, 2, 0, 4, 1, 2, 6, 2, 1, …
## $ age      <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 1…


Bar plots of the chosen variables:

gather(sel_alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

Small amount of absences are common and only very few students have more than 10 absences. Most studetns are between 15 and 18 years ols and only few are older. Grades seem to follow normal distribution that is somewhat shifted towards better scores — 10 and 12 are the most received grades. Going out with friends is also quite normally distributed — most students responded with 3 or 2.



Making box plots of the variables as high use in the x-axis and gender defining the colour:

library(cowplot)

# initialize the plots for the 4 variables chosen
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex)) + geom_boxplot() + ylab("grade") + xlab("high use") + ggtitle("Student grade by alcohol consumption and sex")
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex)) + geom_boxplot() + xlab("high use") + ggtitle("Student absences by alcohol consumption and sex")
g3 <- ggplot(alc, aes(x = high_use, y = goout, col = sex)) + geom_boxplot() + ylab("going out") + xlab("high use") + ggtitle("Going out by alcohol consumption and sex")
g4 <- ggplot(alc, aes(x = high_use, y = age, col = sex)) + geom_boxplot() + xlab("high use") + ggtitle("Student age by alcohol consumption and sex")

plot_grid(g1,g2,g3,g4)


Producing summary statistics for each selected variable based on high use of alcohol and gender:

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex   high_use count mean_grade
##   <chr> <lgl>    <int>      <dbl>
## 1 F     FALSE      154       11.4
## 2 F     TRUE        41       11.8
## 3 M     FALSE      105       12.3
## 4 M     TRUE        70       10.3

Grades: I hypothesized that the grades would be negatively affected by alcohol use. According to the box plot and the mean grade shown above, this is true only for male students — males not using too much alcohol have higher average grade than females (using alcohol often or not) and the males using alcohol often have notably lower average grade than females. The box plot shows that some male students have received very low grades in case of high alcohol consumption, which brings the average down quite a lot. For female students the variance is smaller, and no outliers are present (see box plot).


alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_absences = mean(absences))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex   high_use count mean_absences
##   <chr> <lgl>    <int>         <dbl>
## 1 F     FALSE      154          4.25
## 2 F     TRUE        41          6.85
## 3 M     FALSE      105          2.91
## 4 M     TRUE        70          6.1

Absences: I assumed that the higher alcohol use would lead to absences. For both genders this seems to hold true. Male students not using alcohol have very low average of absences, only 2.9, so the difference to the high alcohol users is very clear, approx 3 more days of absences. With female students the difference is approx. 2.6 days. Female students have more absences in general and there are several students with much higher number of absences than the average — all the way to more than 40.


alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_goout = mean(goout))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex   high_use count mean_goout
##   <chr> <lgl>    <int>      <dbl>
## 1 F     FALSE      154       2.95
## 2 F     TRUE        41       3.39
## 3 M     FALSE      105       2.70
## 4 M     TRUE        70       3.93

Going out: The box plot for going out looks a bit odd due to the likert scale. Median among the less alcohol using students is 3 and those using more than twise a week 4 (male and female). Here also the hypothesis holds. Difference between the mean values is larger for male students according to the summary statistics above. Students going out more are also consuming more alcohol, which makes a lot of sense.


alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_age = mean(age))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex   high_use count mean_age
##   <chr> <lgl>    <int>    <dbl>
## 1 F     FALSE      154     16.6
## 2 F     TRUE        41     16.5
## 3 M     FALSE      105     16.3
## 4 M     TRUE        70     16.9

Age: Mean age of those using alcohol many times a week is slightly smaller for female students and higher for male students than the age of non-alcohol users. This shows clearly in the median in the box plot. Female students maybe start drinking earlier than male students. It seems to hold true to some extent that younger students drink more — however, the students are generally quite young as shown in the bar plots earlier.



4 Logistic regression (5p)

Statistically exploring the data with logistic regression:

m <- glm(high_use ~ age + absences + G3 + goout, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ age + absences + G3 + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8967  -0.7525  -0.5409   0.9467   2.2946  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.77834    1.96186  -1.926  0.05412 .  
## age          0.04563    0.11022   0.414  0.67890    
## absences     0.07315    0.02237   3.270  0.00108 ** 
## G3          -0.04472    0.03923  -1.140  0.25435    
## goout        0.70668    0.11917   5.930 3.03e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 388.67  on 365  degrees of freedom
## AIC: 398.67
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)         age    absences          G3       goout 
## -3.77834234  0.04562870  0.07314823 -0.04471930  0.70667982
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR        2.5 %   97.5 %
## (Intercept) 0.02286056 0.0004637374 1.033922
## age         1.04668570 0.8430071337 1.299954
## absences    1.07589001 1.0312932049 1.127070
## G3          0.95626587 0.8852281993 1.032878
## goout       2.02724925 1.6139414455 2.577757

Odds ratios (OR) tell about the odds of high use of alcohol based on the explanatory variable. If the value is over 1, there is a positive effect of the explanatory variable to the target variable (high use). In this case, age, absences and going out increase the changes of high alcohol use, and alcohol use and grades have negative association. However, as the significance test in the model summary shows, only absences and going out have a statistical significance in the model.

In case the value 1 is between the confidence intervals (CI), there is no evidence of an association between the variables. In this case, only the confidence intervals of absences and going out don’t include the value 1, which means that those two variables and high alcohol use have association.

When it comes to the hypothesis before, this goes together with the previous analysis based on box plots. Absences have a clear positive correlation with the high alcohol use, as does going out with friends. Based on the previous, this model would look different regarding the grades if it was made for male students separately, whose grades were notably lower if they were using alcohol more than twice a week.


5 Predictive power of the model (3p)

For this exploration I will include the variables ‘absences’ and ‘goout’ according to the analysis above. Here I will use the model to predict the the actual values, and save and compare the predictions to the actual values. In case probability is larger than 0.5, the high use is considered TRUE.

m2 <- glm(high_use ~ absences + goout, data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m2, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# see the last ten original classes, predicted probabilities, and class predictions
select(alc, absences, goout, high_use, probability, prediction) %>% tail(10)
##     absences goout high_use probability prediction
## 361        7     3     TRUE  0.28963176      FALSE
## 362        3     3     TRUE  0.23104046      FALSE
## 363        2     1     TRUE  0.06029226      FALSE
## 364        4     4     TRUE  0.40315734      FALSE
## 365        3     2    FALSE  0.12606082      FALSE
## 366        4     3     TRUE  0.24487648      FALSE
## 367        0     2    FALSE  0.10291931      FALSE
## 368        4     4     TRUE  0.40315734      FALSE
## 369        8     4     TRUE  0.47825016      FALSE
## 370        0     2    FALSE  0.10291931      FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   236   23
##    TRUE     65   46

The table of the last ten values shows that the model has predicted all of these cases to not have high alcohol consumption based on absences and going out, but the actual observations have 7/10 high alcohol usage. This model is clearly not perfect. Still the table of target variable vs. predictions shows that most cases are predicted correctly by the model.


Then I plot the predictions against the actual values (graphic visualization). If the model was perfect in predicting the actual values, the upper line would have dots only on the right side (probability > 0.5) and lower line only on the left side.

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.63783784 0.06216216 0.70000000
##    TRUE  0.17567568 0.12432432 0.30000000
##    Sum   0.81351351 0.18648649 1.00000000

According to the table, approximately 26 % of the predictions went wrong (the training error).


6 10-fold cross-validation (Bonus 2p)

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2513514

The average number of wrong predictions in the cross validation is aroun 0.24-0.26 (changes with each run). Therefore the model’s prediction capabilities are very close to the model at Datacamp.


END OF WEEK 3!


Week 4: Clustering and Classification

date()
## [1] "Mon Nov 22 16:26:56 2021"


1 Description of the data (1p)

The data represents Housing Values in Suburbs of Boston from the 70s. Loading the data and taking a look at the data:

library(dplyr)
library(ggplot2)
library(tidyr)
library(boot)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(corrplot)
## corrplot 0.92 loaded
# load the data
data(Boston)
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
dim(Boston)
## [1] 506  14

There are 506 observations and the following 14 variables:

  • crim: pre capita crime rate by town
  • zn: proportion of residential land zoned for lots over 25,000 sq.ft
  • indus: proportion of non-retail business acres per town.
  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox: nitrogen oxides concentration (parts per 10 million).
  • rm: average number of rooms per dwelling.
  • age: proportion of owner-occupied units built prior to 1940.
  • dis: weighted mean of distances to five Boston employment centres.
  • rad: index of accessibility to radial highways.
  • tax: full-value property-tax rate per $10,000.
  • ptratio: pupil-teacher ratio by town.
  • black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
  • lstat: lower status of the population (percent).
  • medv: median value of owner-occupied homes in $1000s.


2 Graphical overview of the data (2p)

Looking at the summary of the data:

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00


Making a correlation matrix and visualizing it:

# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(digits = 2)

cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

Based on the visual overview the strongest negative correlation is between:

  • NO concentration and distance to employment centres
  • proportion of buildings built before 1940 and weighed mean distance to five employment centres
  • lower status of the population and median value of owner-occupied homes

According to this it sounds like the employment centres produce NO emissions, which makes sense if they are based of heavy industry. Also the second is clear: areas with older buildings are built close to the work places and newer buildings are further away since the place is occupied already. More lower status population unsurprisingly correlates with lower value of owner-occupied homes.

And the strongest positive correlation is between:

  • accessibility to radial highways and weighed mean distance to five employment centres
  • NO concentration and proportion of non-retail business acres per town

These are following the earlier: the access to highways indicated smaller distance to employment centres and industry (non-retail) businesses emit NO emissions.


3 Standardizing the data and categorizing crime rate (2p)

#centre and standardize variables
boston_scaled <- scale(Boston)

summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
boston_scaled <- as.data.frame(boston_scaled)

The mean of the variables is now zero since the standardizing includes dividing all the variables with their means.


Next I will create a factor variable of crime rate and remove the old one, as well as split the data into test and train sets for testing of predictions.

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)

crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low","med_low","med_high","high"))

# removing the old crime rate variable
boston_scaled <- dplyr::select(boston_scaled, -crim)

# adding the new categorized crime rate variable to the standardized dataset
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows 
n <- nrow(boston_scaled)

# choosing randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# creating the train set
train <- boston_scaled[ind,]

# ... and the test set 
test <- boston_scaled[-ind,]

4 Linear discriminant analysis (3p)

Then I will fit the linear discriminant analysis to the train data with crime rate as the target variable and all the other variables as predictor variables.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2524752 0.2351485 0.2574257 
## 
## Group means:
##                   zn      indus         chas        nox          rm        age
## low       0.88711774 -0.9227613 -0.157656245 -0.8765262  0.50233351 -0.8455100
## med_low  -0.09777286 -0.2993349 -0.002135914 -0.5614436 -0.07205145 -0.3698962
## med_high -0.37711365  0.2397391  0.142102536  0.4438409  0.16888337  0.4160529
## high     -0.48724019  1.0149946 -0.045188669  1.0221915 -0.39952243  0.8083092
##                 dis        rad        tax    ptratio       black       lstat
## low       0.7884350 -0.7008870 -0.7738451 -0.4718608  0.37648358 -0.76919504
## med_low   0.3117408 -0.5573888 -0.5041183 -0.1184852  0.31977281 -0.17780689
## med_high -0.4358222 -0.4112644 -0.2847119 -0.3971204  0.08086429 -0.07361578
## high     -0.8549995  1.6596029  1.5294129  0.8057784 -0.79176554  0.86045092
##                 medv
## low       0.59413061
## med_low   0.05442224
## med_high  0.21675404
## high     -0.66891024
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.12123030  0.610895304 -0.97397635
## indus    0.04565686 -0.360594257  0.30828851
## chas    -0.08925399 -0.041333545  0.22793753
## nox      0.28083238 -0.796605708 -1.25870966
## rm      -0.12017943 -0.117179919 -0.17463934
## age      0.23895204 -0.232733101 -0.23843974
## dis     -0.10529709 -0.116187535  0.10670844
## rad      3.71309876  0.882793731 -0.05707155
## tax      0.08214568 -0.021721523  0.38126064
## ptratio  0.10071342  0.184665108 -0.25254036
## black   -0.14799936  0.001525138  0.13938999
## lstat    0.15050703 -0.063422576  0.52012330
## medv     0.16359951 -0.261202949 -0.11244441
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9614 0.0290 0.0096


The following represent the LDA biplot:

# making arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plotting the results with arrows
plot(lda.fit, dimen = 2,col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

5 Prediction of classes (3p)

The correct classes from the test data are saved and then removed here:

# saving the correct classes from test data
correct_classes <- test$crime

# lastly remove the crime variable from test data
test <- dplyr::select(test, -crime)


Then I will use the LDA model for predicting the classes, and cross tabulate the results with the categories from the test set.

lda.pred <- predict(lda.fit, newdata = test)

table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       18       6        0    0
##   med_low    2      16        6    0
##   med_high   1      17       11    2
##   high       0       0        1   22


The category ‘high’ is very well predicted by the model. The observations belonging to the category ‘low’ are in most of the cases predicted as ‘med-low’. In general the model seems pretty good, but due to randomness of defining the test and train groups, the model changes somewhat with every run and therefore the results vary as well. An average over many runs would be needed for more precise estimate of the performance of the model.


6 K-means (4p)

I will load the Boston data again and standardize it, after which I will calculate the distances between the observations.

data('Boston')
boston_scaled2 = scale(Boston)

#calculating euclidean distance matrix
dist_eu <- dist(Boston)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.119  85.624 170.539 226.315 371.950 626.047


Next the K-means algoritm with 2 clusters is run and the results visualized (only columns 6-10 are plotted to see the results better):

km <-kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)

Then 3 clusters instead of 2 is used, which looks good and clear groups can be seen:

km <-kmeans(Boston, centers = 3)

# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)


END OF WEEK 4!


(more chapters to be added similarly as we proceed with the course!)